fix: Summit cold start from checkpoint deadlock#131
fix: Summit cold start from checkpoint deadlock#131HenryMBaldwin merged 14 commits intoh/staking-and-joiningfrom
Conversation
…forkchoice if syncing is required
daltoncoder
left a comment
There was a problem hiding this comment.
Nice looks good just some minor comments
| } else { | ||
| warn!( | ||
| ?status, | ||
| "unexpected response to initial forkchoice update, proceeding anyway" | ||
| ); | ||
| break; | ||
| } |
There was a problem hiding this comment.
The only other status not covered in the other branches is Invalid
/// INVALID is returned by the engine API in the following calls:
/// - forkchoiceUpdate: if the new head is unknown, pre-merge, or reorg to it fails
Since this is on startup this is going to be unrecoverable for a node and will crash as soon as he gets a block. Lets just panic here now with a helpful message something like "Finalizer started with invalid forkchoice"
finalizer/src/actor.rs
Outdated
| height = block.height(), | ||
| "execution client returned SYNCING, sending forkchoice update to trigger sync and retrying..." | ||
| ); | ||
| engine_client.commit_hash(state.forkchoice).await; |
There was a problem hiding this comment.
We discussed this in the office but this commit is not needed and we should remove
finalizer/src/actor.rs
Outdated
| (true, false) => { | ||
| warn!( | ||
| new_height, | ||
| "payload valid but parent hash mismatch, not executing" | ||
| ); | ||
| } |
There was a problem hiding this comment.
As far as i can tell this branch is unreachable. Waiting for @matthias-wright to also check this out but if that is the case this whole thing can be simplified to just check payload_status.is_valid()
| // Response override queues | ||
| check_payload_overrides: VecDeque<PayloadStatus>, | ||
| commit_hash_overrides: VecDeque<ForkchoiceUpdated>, |
There was a problem hiding this comment.
Cool nice solution to fixing up the MockEngineClient
node/src/engine.rs
Outdated
| pub const BLOCKS_PER_EPOCH: u64 = 10; | ||
| #[cfg(all(not(debug_assertions), not(feature = "e2e")))] | ||
| const BLOCKS_PER_EPOCH: u64 = 10000; | ||
| const BLOCKS_PER_EPOCH: u64 = 50; |
There was a problem hiding this comment.
lets revert this until the PR that adds this to Genesis file. It will break some test binaries we have
Issue
When starting a wiped node from a summit checkpoint where the execution client has clean state, summit deadlocks (irrecoverably AFAIK) because it treats
SYNCINGfrom the execution client as a failure.Solution
This is solved with an initial syncing phase, prompted by sending an initial forkchoice update to the execution client to set the sync target, and polling until it's done syncing.
We also attempt to sync in execute block if exec client returns syncing. This situation means something has likely gone catastrophically wrong, and should never happen with our reth client, so we log an error instead of a warn while attempting to wait for exec client to recover.
Notably, if the execution client doesn't have any peers or is unable to sync for any reason, this is also effectively stalls the node. IMO this is still a strict improvement over previous behavior.